{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vanishing & Exploding Gradients\n",
    "* gradients get smaller as algorithm progresses to lower layers. Eventually GD leaves lower weights virtually unchanged. so training never converges.\n",
    "* gradients can also grow out of control (often seen in RNNs).\n",
    "* [Significant paper](http://goo.gl/1rhAef) - using combo of logistic sigmoid activiation with random weight initialization (normal, mean=0, stdev=1) -- output variance was >> input variance.\n",
    "* logistic activation: function saturates at 0 or 1 with derivative very close to 0 ==> so backpropagation has no gradient to use.\n",
    "![sigmoid](pics/sigmoid.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Xavier & He Initialization\n",
    "* For signals to flow properly in both directions, each layer's output variance should equal its input variance.\n",
    "* Recommends initializing connection weights with random settings using #ins, #outs\n",
    "![init parameters](pics/init-params-for-activation-funcs.png)\n",
    "* Default: *fully_connected()* function uses Xavier initialization w/ uniform distribution. Change to He initialization by using *variance_scaling_initializer()* function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow.contrib.layers import fully_connected\n",
    "\n",
    "n_inputs = 28*28\n",
    "n_hidden1 = 300\n",
    "\n",
    "X = tf.placeholder(tf.float32, shape=(None, n_inputs), name=\"X\")\n",
    "\n",
    "he_init = tf.contrib.layers.variance_scaling_initializer()\n",
    "hidden1 = fully_connected(X, n_hidden1, weights_initializer=he_init, scope=\"h1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-Saturating Activation Functions\n",
    "* ReLU activations suffer from *dying ReLU* problem (they stop emitting anything other than zero).\n",
    "* Workaround: the **leaku ReLU**. Alpha defines leakage; typical set to 0.01.\n",
    "![leaku ReLU](pics/leaky-relu.png)\n",
    "* Also: **randomized leaky ReLU (RReLU)** (randomized alpha)\n",
    "* Also: **parametric leaky RuLE (PReLU)** (alpha can be modified during backprop)\n",
    "* Also: **exponential linear unit (ELU)**. Allows negative values when z<0; non-zero gradient for z<0 (avoids dying units issue); smooth function everywhere. Uses exponential function, so harder to compute. [paper](http://goo.gl/Sdl2P7)\n",
    "![exponential-relu](pics/exponential-relu.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# TF doesn't have leaky ReLU predefined, but easy to build.\n",
    "\n",
    "def leaky_relu(z, name=None):\n",
    "    return tf.maximum(0.01 * z, z, name=name)\n",
    "\n",
    "hidden1 = fully_connected(X, n_hidden1, activation_fn=leaky_relu)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch Normalization\n",
    "* [proposed](https://goo.gl/gA4GSP) to solve vanishing/exploding gradients. \n",
    "* Idea: pror to activation function,\n",
    "    1) zero-center & normalize inputs\n",
    "    2) scale & shift result with 2 new params per layer\n",
    "* Net effect: model learns optimal scale & mean of inputs for each layer\n",
    "* Algorithm:\n",
    "![batch normalization](pics/batch-normalization.png)\n",
    "\n",
    "* Does add computational complexity. Consider plain ELU + He initializaton as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Batch Normalization with TF\n",
    "* *batch_normalization()* - centers & normalizes inputs\n",
    "* *batch_norm()* - above, plus finds mean, stdev, scaling, offset params\n",
    "* call directly or include it in *fully_connected()* arguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.\n",
      "Extracting /tmp/data/train-images-idx3-ubyte.gz\n",
      "Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.\n",
      "Extracting /tmp/data/train-labels-idx1-ubyte.gz\n",
      "Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.\n",
      "Extracting /tmp/data/t10k-images-idx3-ubyte.gz\n",
      "Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.\n",
      "Extracting /tmp/data/t10k-labels-idx1-ubyte.gz\n"
     ]
    }
   ],
   "source": [
    "# use MNIST dataset again\n",
    "from tensorflow.examples.tutorials.mnist import input_data\n",
    "mnist = input_data.read_data_sets(\"/tmp/data/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#setup\n",
    "\n",
    "import tensorflow as tf\n",
    "from tensorflow.contrib.layers import batch_norm, fully_connected\n",
    "\n",
    "tf.reset_default_graph() \n",
    "\n",
    "n_inputs = 28 * 28\n",
    "n_hidden1 = 300\n",
    "n_hidden2 = 100\n",
    "n_outputs = 10\n",
    "learning_rate = 0.01\n",
    "\n",
    "X = tf.placeholder(tf.float32, shape=(None, n_inputs), name=\"X\")\n",
    "y = tf.placeholder(tf.int64, shape=(None), name=\"y\")\n",
    "\n",
    "def leaky_relu(z, name=None):\n",
    "  return tf.maximum(0.01 * z, z, name=name)\n",
    "\n",
    "# is_training: tells batch_norm() whether to use current minibatch's mean & stdev \n",
    "# (found during training) or use running avgs (during testing)\n",
    "\n",
    "with tf.name_scope(\"dnn\"):\n",
    "    hidden1 = fully_connected(X, n_hidden1, activation_fn=leaky_relu, scope=\"hidden1\")\n",
    "    hidden2 = fully_connected(hidden1, n_hidden2, activation_fn=leaky_relu, scope=\"hidden2\")\n",
    "    logits = fully_connected(hidden2, n_outputs, activation_fn=None, scope=\"outputs\")\n",
    "\n",
    "with tf.name_scope(\"loss\"):\n",
    "    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)\n",
    "    loss = tf.reduce_mean(xentropy, name=\"loss\")\n",
    "\n",
    "with tf.name_scope(\"train\"):\n",
    "    optimizer = tf.train.GradientDescentOptimizer(learning_rate)\n",
    "    training_op = optimizer.minimize(loss)\n",
    "\n",
    "with tf.name_scope(\"eval\"):\n",
    "    correct = tf.nn.in_top_k(logits, y, 1)\n",
    "    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))\n",
    "\n",
    "init = tf.global_variables_initializer()\n",
    "saver = tf.train.Saver()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 Train accuracy: 0.6 Test accuracy: 0.642\n",
      "1 Train accuracy: 0.73 Test accuracy: 0.7824\n",
      "2 Train accuracy: 0.81 Test accuracy: 0.827\n",
      "3 Train accuracy: 0.84 Test accuracy: 0.8539\n",
      "4 Train accuracy: 0.8 Test accuracy: 0.8686\n",
      "5 Train accuracy: 0.87 Test accuracy: 0.8759\n",
      "6 Train accuracy: 0.85 Test accuracy: 0.8843\n",
      "7 Train accuracy: 0.91 Test accuracy: 0.8903\n",
      "8 Train accuracy: 0.86 Test accuracy: 0.8969\n",
      "9 Train accuracy: 0.91 Test accuracy: 0.9018\n",
      "10 Train accuracy: 0.91 Test accuracy: 0.9014\n",
      "11 Train accuracy: 0.86 Test accuracy: 0.9065\n",
      "12 Train accuracy: 0.88 Test accuracy: 0.9078\n",
      "13 Train accuracy: 0.87 Test accuracy: 0.91\n",
      "14 Train accuracy: 0.93 Test accuracy: 0.911\n",
      "15 Train accuracy: 0.9 Test accuracy: 0.9123\n",
      "16 Train accuracy: 0.91 Test accuracy: 0.9141\n",
      "17 Train accuracy: 0.9 Test accuracy: 0.9149\n",
      "18 Train accuracy: 0.92 Test accuracy: 0.9159\n",
      "19 Train accuracy: 0.93 Test accuracy: 0.9174\n"
     ]
    }
   ],
   "source": [
    "n_epochs = 20\n",
    "batch_size = 100\n",
    "\n",
    "with tf.Session() as sess:\n",
    "    \n",
    "    init.run()\n",
    "    \n",
    "    for epoch in range(n_epochs):\n",
    "        \n",
    "        for iteration in range(len(mnist.test.labels)//batch_size):\n",
    "            \n",
    "            X_batch, y_batch = mnist.train.next_batch(batch_size)\n",
    "            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})\n",
    "            \n",
    "        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})\n",
    "        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})\n",
    "        \n",
    "        print(epoch, \"Train accuracy:\", acc_train, \"Test accuracy:\", acc_test)\n",
    "\n",
    "    save_path = saver.save(sess, \"my_model_final.ckpt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gradient Clipping\n",
    "* to limit exploding gradients problem. Clip during backprop.\n",
    "* Typical use case: recurrent NNs. \n",
    "* [source](http://goo.gl/dRDAaf)\n",
    "* Uses TF *minimize()* function in optimizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "threshold = 1.0\n",
    "\n",
    "optimizer = tf.train.GradientDescentOptimizer(\n",
    "    learning_rate)\n",
    "\n",
    "grads_and_vars = optimizer.compute_gradients(\n",
    "    loss)\n",
    "\n",
    "capped_gvs = [\n",
    "    (tf.clip_by_value(\n",
    "        grad, -threshold, threshold), var)\n",
    "    for grad, var in grads_and_vars]\n",
    "\n",
    "training_op = optimizer.apply_gradients(capped_gvs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pretrained Layers & Reuse\n",
    "* best practice: look for existing NN that tackles similar task, then reuse lower layers (aka *transfer learning*)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Reuse with TF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reusing Models from Other Frameworks\n",
    "* Requires manual loading of weights (ex: Theano)\n",
    "* Very tedious"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\noriginal_w = [] # Load the weights from the other framework\\noriginal_b = [] # Load the biases from the other framework\\n\\nX = tf.placeholder(tf.float32, shape=(None, n_inputs), name=\"X\")\\nhidden1 = fully_connected(X, n_hidden1, scope=\"hidden1\")\\n\\n[...] # # Build the rest of the model\\n\\n# Get a handle on the variables created by fully_connected()\\n\\nwith tf.variable_scope(\"\", default_name=\"\", reuse=True): # root scope\\n    hidden1_weights = tf.get_variable(\"hidden1/weights\")\\n    hidden1_biases = tf.get_variable(\"hidden1/biases\")\\n\\n# Create nodes to assign arbitrary values to the weights and biases\\noriginal_weights = tf.placeholder(tf.float32, shape=(n_inputs, n_hidden1))\\noriginal_biases = tf.placeholder(tf.float32, shape=(n_hidden1))\\n\\nassign_hidden1_weights = tf.assign(hidden1_weights, original_weights)\\nassign_hidden1_biases = tf.assign(hidden1_biases, original_biases)\\n\\ninit = tf.global_variables_initializer()\\n\\nwith tf.Session() as sess:\\n    sess.run(init)\\n    sess.run(\\n        assign_hidden1_weights, \\n        feed_dict={original_weights: original_w})\\n        \\n    sess.run(\\n        assign_hidden1_biases, \\n        feed_dict={original_biases: original_b})\\n        \\n    [...] # Train the model on your new task\\n'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'''\n",
    "original_w = [] # Load the weights from the other framework\n",
    "original_b = [] # Load the biases from the other framework\n",
    "\n",
    "X = tf.placeholder(tf.float32, shape=(None, n_inputs), name=\"X\")\n",
    "hidden1 = fully_connected(X, n_hidden1, scope=\"hidden1\")\n",
    "\n",
    "[...] # # Build the rest of the model\n",
    "\n",
    "# Get a handle on the variables created by fully_connected()\n",
    "\n",
    "with tf.variable_scope(\"\", default_name=\"\", reuse=True): # root scope\n",
    "    hidden1_weights = tf.get_variable(\"hidden1/weights\")\n",
    "    hidden1_biases = tf.get_variable(\"hidden1/biases\")\n",
    "\n",
    "# Create nodes to assign arbitrary values to the weights and biases\n",
    "original_weights = tf.placeholder(tf.float32, shape=(n_inputs, n_hidden1))\n",
    "original_biases = tf.placeholder(tf.float32, shape=(n_hidden1))\n",
    "\n",
    "assign_hidden1_weights = tf.assign(hidden1_weights, original_weights)\n",
    "assign_hidden1_biases = tf.assign(hidden1_biases, original_biases)\n",
    "\n",
    "init = tf.global_variables_initializer()\n",
    "\n",
    "with tf.Session() as sess:\n",
    "    sess.run(init)\n",
    "    sess.run(\n",
    "        assign_hidden1_weights, \n",
    "        feed_dict={original_weights: original_w})\n",
    "        \n",
    "    sess.run(\n",
    "        assign_hidden1_biases, \n",
    "        feed_dict={original_biases: original_b})\n",
    "        \n",
    "    [...] # Train the model on your new task\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Freezing Lower Layers\n",
    "* If 1st DNN already learned low-level features, try to reuse them by freezing the weights.\n",
    "* simplest way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\ntrain_vars = tf.get_collection(\\n    tf.GraphKeys.TRAINABLE_VARIABLES,\\n    scope=\"hidden[34]|outputs\")\\n\\n# minimizer can\\'t touch layers 1,2 - they\\'re \"frozen\"\\n\\ntraining_op = optimizer.minimize(\\n    loss, \\n    var_list=train_vars)\\n'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# provide all trainable var in hidden layers 3,4 & outputs to optimizer function\n",
    "# (this omits vars in hidden layers 1,2)\n",
    "'''\n",
    "train_vars = tf.get_collection(\n",
    "    tf.GraphKeys.TRAINABLE_VARIABLES,\n",
    "    scope=\"hidden[34]|outputs\")\n",
    "\n",
    "# minimizer can't touch layers 1,2 - they're \"frozen\"\n",
    "\n",
    "training_op = optimizer.minimize(\n",
    "    loss, \n",
    "    var_list=train_vars)\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Caching Lower Layers\n",
    "* Huge speed boost!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'import numpy as np\\n\\nn_epochs = 100\\nn_batches = 500\\n\\nfor epoch in range(n_epochs):\\n    shuffled_idx = rnd.permutation(\\n        len(hidden2_outputs))\\n    \\n    hidden2_batches = np.array_split(\\n        hidden2_outputs[shuffled_idx], \\n        n_batches)\\n    \\ny_batches = np.array_split(\\n    y_train[shuffled_idx], \\n    n_batches)\\n\\nfor hidden2_batch, y_batch in zip(hidden2_batches, y_batches):\\n    sess.run(\\n        training_op, \\n        feed_dict={hidden2: hidden2_batch, y: y_batch})\\n'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'''import numpy as np\n",
    "\n",
    "n_epochs = 100\n",
    "n_batches = 500\n",
    "\n",
    "for epoch in range(n_epochs):\n",
    "    shuffled_idx = rnd.permutation(\n",
    "        len(hidden2_outputs))\n",
    "    \n",
    "    hidden2_batches = np.array_split(\n",
    "        hidden2_outputs[shuffled_idx], \n",
    "        n_batches)\n",
    "    \n",
    "y_batches = np.array_split(\n",
    "    y_train[shuffled_idx], \n",
    "    n_batches)\n",
    "\n",
    "for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):\n",
    "    sess.run(\n",
    "        training_op, \n",
    "        feed_dict={hidden2: hidden2_batch, y: y_batch})\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tweaking/Dropping/Replacing Upper Layers\n",
    "* original output layer: should be replaced (little chance of reuse)\n",
    "* iterative freeze/train/compare process to see how many upper layers needed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Zoos\n",
    "* When you want to find a net already trained on a similar task\n",
    "* [TensorFlow Model Zoo](https://github.com/tensorflow/models)\n",
    "* [Caffe Model Zoo](https://goo.gl/XI02X3) - converter on [github](https://github.com/ethereon/caffe-tensorflow)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unsupervised pre-training\n",
    "* Tough problem, but doable.\n",
    "* Train layers one-by-one, starting with lowest layer\n",
    "* Freeze completed layers & train next layer on previous results\n",
    "![unsupervised pretraining](pics/unsupervised-pretraining.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-training on easily labeled data - reuse lower layers for \"real\" task\n",
    "* Often required due to cost/availability of large labeled datasets\n",
    "* Common tactic: label all training data as \"good\", generate & corrupt additional instances, label new ones as \"bad."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Faster Optimizers\n",
    "* Training speedup strategies thus far: 1) smart weight initializations, 2) smart activation functions, 3) batch normalization, 4) reuse of pretraining.\n",
    "* Better optimizer choices:\n",
    "    * Momentum optimization\n",
    "    * Nesterov Accelerated Gradients\n",
    "    * AdaGrad\n",
    "    * RMSProp\n",
    "    * Adam (should almost always use this one)\n",
    "    \n",
    "* Worth noting: below techniques rely on 1st-order partial derivatives (Jacobians); more techniques in literature use 2nd-order derivs (Hessians). Not viable for most deep learning due to memory & computational requirements."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Momentum optimization\n",
    "* local gradient added to a **momentum vector** (m) multiplied by learning rate (n)\n",
    "* ie, **gradient used as an accelerant - not as a speed.**\n",
    "* *beta* hyperparameter serves as friction mechanism. 0 = high friction, 1 = no friction.\n",
    "* Momentum optimization escapes plateaus much faster than GD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# in TF\n",
    "\n",
    "optimizer = tf.train.MomentumOptimizer(\n",
    "    learning_rate=learning_rate,\n",
    "    momentum=0.9)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Nesterov Accelerated Gradient\n",
    "* idea: measure cost function gradient *slightly ahead in direction of momentum*.\n",
    "![nesterov vs regular momentum optimization](pics/nesterov.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# in TF\n",
    "\n",
    "optimizer = tf.train.MomentumOptimizer(\n",
    "    learning_rate=learning_rate,\n",
    "    momentum=0.9, \n",
    "    use_nesterov=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### AdaGrad\n",
    "* Scales gradient vector along steepest dimensions, ie it decays the learning rate faster for steep dimensions. (ie *adaptive learning rate*)\n",
    "* Works on simple quadratic problems but often stops too early.\n",
    "![adagrad vs gradient descent](pics/adagrad.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### RMSProp\n",
    "* Fixes AdaGrad problem by accumulating most recent gradients (instead of all). \n",
    "* Better than AdaGrad on all but very simple problems. Also better than MO and Nesterov."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# in TF\n",
    "\n",
    "optimizer = tf.train.RMSPropOptimizer(\n",
    "    learning_rate=learning_rate,\n",
    "    momentum=0.9, \n",
    "    decay=0.9, \n",
    "    epsilon=1e-10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Adam Optimization ([paper:](https://goo.gl/Un8Axa))\n",
    "* Keeps track of decaying past gradients (like Momentum Optimization)\n",
    "* Keeps track of decaying past squared gradients (like RMSProp)\n",
    "\n",
    "* Default params in TF:\n",
    "* Momentum decay param (beta1) usually set to 0.9\n",
    "* Scaling decay param (beta2) usually set to 0.999\n",
    "* Smoothing term (epsilon) usually set to 10e-8"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# in TF\n",
    "\n",
    "optimizer = tf.train.AdamOptimizer(\n",
    "    learning_rate=learning_rate)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Learning Rate Scheduling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# in TF\n",
    "\n",
    "initial_learning_rate = 0.1\n",
    "decay_steps = 10000\n",
    "decay_rate = 1/10\n",
    "\n",
    "global_step = tf.Variable(\n",
    "    0, trainable=False)\n",
    "\n",
    "learning_rate = tf.train.exponential_decay(\n",
    "    initial_learning_rate, \n",
    "    global_step,\n",
    "    decay_steps, \n",
    "    decay_rate)\n",
    "\n",
    "optimizer = tf.train.MomentumOptimizer(\n",
    "    learning_rate, \n",
    "    momentum=0.9)\n",
    "\n",
    "training_op = optimizer.minimize(\n",
    "    loss, \n",
    "    global_step=global_step)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regularization Techniques"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Early Stopping\n",
    "* Simply interrupt training when validation performance starts dropping."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### L1 & L2 Regularlization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dropout\n",
    "* Popular technique - typically adds 1-2% accuracy boost\n",
    "* At every training step, every neuron has probability (p) of being temporarily ignored\n",
    "* Typical p = 50%\n",
    "\n",
    "* In TF: apply *dropout()* to input layer & output of every hidden layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# in TF\n",
    "\n",
    "from tensorflow.contrib.layers import dropout\n",
    "\n",
    "[...]\n",
    "\n",
    "is_training = tf.placeholder(\n",
    "    tf.bool, \n",
    "    shape=(), \n",
    "    name='is_training')\n",
    "\n",
    "keep_prob = 0.5\n",
    "\n",
    "X_drop = dropout(\n",
    "    X, \n",
    "    keep_prob, \n",
    "    is_training=is_training)\n",
    "    \n",
    "hidden1      = fully_connected(\n",
    "    X_drop, n_hidden1, scope=\"hidden1\")\n",
    "    \n",
    "hidden1_drop = dropout(\n",
    "    hidden1, keep_prob, is_training=is_training)\n",
    "    \n",
    "hidden2      = fully_connected(\n",
    "    hidden1_drop, n_hidden2, scope=\"hidden2\")\n",
    "    \n",
    "hidden2_drop = dropout(\n",
    "    hidden2, keep_prob, is_training=is_training)\n",
    "    \n",
    "logits       = fully_connected(\n",
    "    hidden2_drop, n_outputs, \n",
    "    activation_fn=None,\n",
    "    scope=\"outputs\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Max-Norm Regularization\n",
    "* Each neuron's incoming weights are constrained such that ||w||2 <= r\n",
    "* r = *max-norm hyperparameter*\n",
    "* ||.|| = l2 norm\n",
    "* Reducing r increases regularization\n",
    "* Not implemented in TF, but doable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Augmentation\n",
    "* Generating new training instances from existing ones with learnable differences\n",
    "* ex: pics with shifts/rotates/resizes/flips/contrasts\n",
    "* TF has image manipulation ops built-in"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Practical Guidelines\n",
    "* Suggested default DNN configurations:\n",
    "    * Initialization: He\n",
    "    * Activation: ELU\n",
    "    * Normalization: Batch\n",
    "    * Regularization: Dropout\n",
    "    * Optimizer: Adam\n",
    "    * Learning Rate schedule: none"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
